AIML Module Project - ENSEMBLE TECHNIQUES

PROJECT BASED

DOMAIN: Telecom

CONTEXT: A telecom company wants to use their historical customer data to predict behaviour to retain customers. You can analyse all relevant customer data and develop focused customer retention programs.

DATA DESCRIPTION: Each row represents a customer, each column contains customer’s attributes described on the column Metadata. The data set includes information about:

PROJECT OBJECTIVE: Build a model that will help to identify the potential customers who have a higher probability to churn. This help the company to understand the pinpoints and patterns of customer churn and will increase the focus on strategising customer retention.

Steps to the project:

  1. Import and warehouse data:
    • Import all the given datasets. Explore shape and size.
    • Merge all datasets onto one and explore final shape and size.
  1. Data cleansing:
    • Missing value treatment
    • Convert categorical attributes to continuous using relevant functional knowledge
    • Drop attribute/s if required using relevant functional knowledge
    • Automate all the above steps
  1. Data analysis & visualisation:
    • Perform detailed statistical analysis on the data.
    • Perform a detailed univariate, bivariate and multivariate analysis with appropriate detailed comments after each analysis.
  1. Data pre-processing:
    • Segregate predictors vs target attributes
    • Check for target balancing and fix it if found imbalanced.
    • Perform train-test split.
    • Check if the train and test data have similar statistical characteristics when compared with original data.
  1. Model training, testing and tuning:
    • Train and test all ensemble models taught in the learning module.
    • Suggestion: Use standard ensembles available. Also you can design your own ensemble technique using weak classifiers.
    • Display the classification accuracies for train and test data.
    • Apply all the possible tuning techniques to train the best model for the given data.
    • Suggestion: Use all possible hyper parameter combinations to extract the best accuracies.
    • Display and compare all the models designed with their train and test accuracies.
    • Select the final best trained model along with your detailed comments for selecting this model.
    • Pickle the selected model for future use.
  1. Conclusion and improvisation:
    • Write your conclusion on the results.
    • Detailed suggestions or improvements or on quality, quantity, variety, velocity, veracity etc. on the data points collected by the telecom operator to perform a better data analysis in future.

Importing Required Python Modules and Libraries

Here we are importing all the Libraries and Modules that are needed for whole project in a single cell.


1. Import and Warehouse Data:

* Import all the given Datasets and Explore Shape and Size.

* Merge all Datasets onto One and Explore Final Shape and Size.

Since we have chosen 2nd approach that is, 'Data set for Direct import using Pandas', we have one one file which includes all the rows and columns. Hence there is no need to merge the data.


Key Observations:-


2. Data Cleansing:

Creating Duplicate Dataset for Future Use

This Dataset will be used to Automate the Data Cleansing process.

* Missing Value Treatment:

Checking for Null Values in the Attributes


Key Observations:-


Checking Data Type of Quantitative Attributes


Key Observations:-


Converting Datatype of TotalCharges to Numerical Float datatype


Key Observations:-


Checking for Null Values in TotalCharges Attributes after Datatype Conversion


Key Observations:-


Removing Null Values in TotalCharges Attributes after Datatype Conversion


Key Observations:-


* Convert Categorical Attributes to Continuous using relevant Functional Knowledge

Checking Data Types and Unique Values in every Categorical Attribute


Key Observations:-


Converting Categorical Attribute to Continuous form


Key Observations:-


* Drop Attribute/s If Required using relevant Functional Knowledge

Information about the Features

For further analysis, we first differentiate between different types of Attributes.

1. Qualitative Attributes:-

2. Quantitative Attributes:-

Dropping customerID Attribute

Since customerID do not give much information and is not helpful for our further process, we are dropping this attribute.


Key Observations:-


* Automate All the Above Steps


Key Observations:-


Detailed Comparison between Before, Manual and Automated Data Cleasing Procedures


Key Observations:-


3. Data Analysis & Visualisation:

* Perform Detailed Statistical Analysis on the Data.

Brief Summary of Data

Checking Skewness of the data attributes

Checking Variance of the data attributes

Checking Correlation by plotting Heatmap for attributes


Key Observations:-


Analysis of Loyalty of Customers.

* Percentage of Loyal Customers

* Highest Tenure of Loyal Customer.

* Lowest Tenure of Loyal Customer.

* Average Tenure of Loyal and Not so Loyal Customer.

* Visualizing Relation between Tenure and Churn.


Key Observations:-


Churn of Customers based on Gender.

Here we check gender based interest on churn.

Analysis of Gender of Customers who do not Churn:-

Analysis of Gender of Customers who Churn:-

* Visualizing Relation between Gender and Churn.


Key Observations:-


Churn of Customers based on Senior Citizen.

Analysis of Senior Citizen of Customers who do not Churn:-

Analysis of Senior Citizen of Customers who Churn:-

Analysis of Churned Senior Citizen of Customers based on Gender:-

* Visualizing Relation between SeniorCitizen and Churn.


Key Observations:-


* Perform a Detailed Univariate, Bivariate and Multivariate Analysis with Appropriate detailed comments after Each Analysis.

Univariate Analysis

Univariate analysis is the simplest form of analyzing data. It involves only one variable.

Creating Functions for Plotting the Quantitative and Categorical Data for Univariate Analysis.

We will use these functions for easy analysis of individual attribute.

Attribute 1: "gender"


Key Observations:-


Attribute 2: "SeniorCitizen"


Key Observations:-


Attribute 3: "Partner"


Key Observations:-


Attribute 4: "Dependents"


Key Observations:-


Attribute 5: "tenure"


Key Observations:-


Attribute 6: "PhoneService"


Key Observations:-


Attribute 7: "MultipleLines"


Key Observations:-


Attribute 8: "InternetService"


Key Observations:-


Attribute 9: "OnlineSecurity"


Key Observations:-


Attribute 10: "OnlineBackup"


Key Observations:-


Attribute 11: "DeviceProtection"


Key Observations:-


Attribute 12: "TechSupport"


Key Observations:-


Attribute 13: "StreamingTV"


Key Observations:-


Attribute 14: "StreamingMovies"


Key Observations:-


Attribute 15: "Contract"


Key Observations:-


Attribute 16: "PaperlessBilling"


Key Observations:-


Attribute 17: "PaymentMethod"


Key Observations:-


Attribute 18: "MonthlyCharges"


Key Observations:-


Attribute 19: "TotalCharges"


Key Observations:-


Attribute 20: "Churn"


Key Observations:-


Bivariate Analysis

Creating Functions for Plotting the Quantitative VS Categorical Data for Bivariate Analysis

Bivariate Analysis 1: "Tenure" VS All Categorical Attributes


Key Observations:-


Bivariate Analysis 2: "MonthlyCharges" VS All Categorical Attributes


Key Observations:-


Bivariate Analysis 3: "TotalCharges" VS All Categorical Attributes


Key Observations:-


Multivariate Analysis

Multivariate analysis is performed to understand interactions between different fields in the dataset.

Multivariate Analysis : To Check Relation Between Quantitative Attributes


Key Observations:-


Multivariate Analysis : To check Density of Target Attribute in all other Quantitative Features


Key Observations:-


Multivariate Analysis : To Check Correlation


Key Observations:-


4. Data Pre-Processing:

Outlier Analysis

Here we will check for outliers in Quantitative data.


Key Observations:-


Feature Scaling Normalization


Key Observations:-


* Segregate Predictors VS Target Attributes

By sperating Predictors and Target attributes, we can perform further operations easily.


Key Observations:-


* Check for Target Balancing and Fix it if found Imbalanced.


Key Observations:-


Fixing Target Imbalance by Synthetic Minority Oversampling Technique (SMOTE)

SMOTE first selects a minority class instance a at random and finds its k nearest minority class neighbors. The synthetic instance is then created by choosing one of the k nearest neighbors b at random and connecting a and b to form a line segment in the feature space. The synthetic instances are generated as a convex combination of the two chosen instances a and b.


Key Observations:-


* Perform Train-Test Split.


Key Observations:-


* Check if the Train and Test data have similar Statistical Characteristics when compared with Original data.

Comparison of Train data with Original data


Key Observations:-


Comparision of Test data with Original data


Key Observations:-


5. Model Training, Testing and Tuning:

* Train and Test all Ensemble Models taught in the learning module.

Building Decision Tree Classifier


Key Observations:-


Building Bagging Classifier


Key Observations:-


Building AdaBoost Classifier


Key Observations:-


Building Gradient Boosting Classifier


Key Observations:-


Building RandomForest Classifier


Key Observations:-


Building Voting Classifier

The idea behind the VotingClassifier is to combine conceptually different machine learning classifiers and use a majority vote or the average predicted probabilities (soft vote) to predict the class labels.

Here we build,

  1. LogisticRegression()
  2. SVC()
  3. DecisionTreeClassifier()

All the models are pre-checked to give best accuracies. Then we us these models to build our Voting Classifier


Key Observations:-


* Display the Classification Accuracies for Train and Test Data.


Key Observations:-


* Apply all the possible Tuning Techniques to Train the best model for the given data.

List of different Models Applied to our Data:-

  1. DecisionTreeClassifier('gini')
  2. DecisionTreeClassifier('entropy')
  3. RandomForestClassifier('gini')
  4. RandomForestClassifier('entropy')
  5. LogisticRegression()
  6. GaussianNB()
  7. SVC('rbf')
  8. SVC('linear')
  9. KNeighborsClassifier()

All the models are pre-checked to give best accuracies


Key Observations:-


* Display and Compare all the Models designed with their Train and Test accuracies.

Displaying Accuracies of Train and Test data for above Trained Models

Comparing Accuracies of Train and Test data for All the Trained Models so far.


Key Observations:-


* Select the Final Best Trained Model along with your detailed comments for selecting this model.

Based on above results, RandomForest Classifier(gini) has highest accuracies for train and test data.

Checking Classification Report for RandomForest Classifier(gini)

Checking Feature Importance of RandomForest Classifier(gini)


Comments:-

By observing above data, we select the RandomForest Classifier (gini) as our Final Best Trained Model.

* Pickle the selected model for future use.

Pickle is the standard way of serializing objects in Python.

Method 1: Saving the Model in SAV File

Here we create a SAV file and save our Model in it.

Method 2: Saving the Model in a Variable

Incase if the above method fails due to error in loading file, you can check this method.

6. Conclusion and Improvisation:

* Write your Conclusion on the Results.

* Detailed suggestions or improvements or on quality, quantity, variety, velocity, veracity etc. on the data points collected by the research team to perform a better data analysis in future.

Closing Sentence:- The Predictions made by our models will help the company to understand the pinpoints and patterns of customer churn and will increase the focus on strategising customer retention.

--------------------- End of AIML MODULE PROJECT 3 ---------------------

------------------------------------------------------------------------------THANK YOU😊----------------------------------------------------------------------------------